Learning String Edit Distance 1

نویسندگان

  • Eric Sven Ristad
  • Peter N. Yianilos
چکیده

In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string edit distance. Our stochastic model allows us to learn a string edit distance function from a corpus of examples. We illustrate the utility of our approach by applying it to the diicult problem of learning the pronunciation of words in conversational speech. In this application, we learn a string edit distance with one fourth the error rate of the untrained Levenshtein distance. Our approach is applicable to any string classiication problem that may be solved using a similarity function against a database of labeled prototypes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning String Edit Distance

In many applications, it is necessary to determine the similarity of two strings. A widely-used notion of string similarity is the edit distance: the minimum number of insertions, deletions, and substitutions required to transform one string into the other. In this report, we provide a stochastic model for string edit distance. Our stochastic model allows us to learn the optimal string edit dis...

متن کامل

MAUL: Machine Agent User Learning∗

We describe implementation of a classifier for User-Agent strings using Support Vector Machines. The best kernel is found to be the linear kernel, even when more complicated string based kernels, such as the edit distance kernel and the subsequence kernel, are employed. A robust tokenization scheme is employed which dramatically speeds up the calculation for the edit string and subsequence kern...

متن کامل

Learnable Similarity Functions and Their Applications to Record Linkage and Clustering

Many machine learning tasks require similarity functions that estimate likeness between observations. Similarity computations are particularly important for clustering and record linkage algorithms that depend on accurate estimates of the distance between datapoints. However, standard measures such as string edit distance and Euclidean distance often fail to capture an appropriate notion of sim...

متن کامل

Learning Balls of Strings from Edit Corrections

When facing the question of learning languages in realistic settings, one has to tackle several problems that do not admit simple solutions. On the one hand, languages are usually defined by complex grammatical mechanisms for which the learning results are predominantly negative, as the few algorithms are not really able to cope with noise. On the other hand, the learning settings themselves re...

متن کامل

SEDiL: Software for Edit Distance Learning

In this paper, we present SEDiL, a Software for EditDistance Learning. SEDiL is an innovative prototype implementation grouping together most of the state of the art methods [1{4] that aim to automatically learn the parameters of string and tree edit distances. ? This work was funded by the French ANR Marmota project, the Pascal Network of Excellence and the Spanish research programme Consolide...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997